[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models#37853
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring to support KV cache offloading for hybrid models. The introduction of CanonicalKVCaches provides a clean abstraction over different KV cache layouts, improving modularity and simplifying the transfer handler logic. The accompanying test refactoring is also a good improvement.
I have one high-severity comment regarding an inconsistency in the data type of the canonicalized tensor, which could lead to maintenance issues in the future. Addressing this would make the new abstraction more robust and easier to reason about.
4b710fe to
cc8dc31
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
cc8dc31 to
943e72a
Compare
NickLucche
left a comment
There was a problem hiding this comment.
hey @orozery , left only a few comments here mainly because I can understand what you're doing, but I am afraid I don't fully understand why you're doing it.
Could you elaborate on why we would need a CanonicalKVCacheTensor here and how is that abstraction making life easier for you in code/transfer?
| test_shape = attn_backends[layer_name].get_kv_cache_shape( | ||
| num_blocks=1234, | ||
| block_size=16, | ||
| num_kv_heads=1, | ||
| head_size=256, | ||
| ) | ||
| num_blocks_logical_dim = test_shape.index(1234) |
There was a problem hiding this comment.
there's a get_kv_cache_block_dim API now we can use
There was a problem hiding this comment.
Actually I see that I use test_shape a bit below for asserting (2, ...) for flash attention.
| page_size_bytes[layer_name] = layer_kv_cache_spec.page_size_bytes | ||
| unpadded_page_size_bytes[layer_name] = replace( | ||
| layer_kv_cache_spec, page_size_padded=None | ||
| ).page_size_bytes | ||
|
|
||
| else: | ||
| raise NotImplementedError | ||
|
|
There was a problem hiding this comment.
is there any way we can make this if-elif-else simpler by eg factoring out common assignment at the end?
There was a problem hiding this comment.
All cases need to assign to tensors_per_block, page_size_bytes and unpadded_page_size_bytes.
To move assignments out I will need to introduce a local variable per each of these 3 dictionaries.
I don't see it simplifying.
But maybe I missed your point...
The offloading connector supports pluggable backends (e.g. CPU backend, file-system in the future). One such complexity is registering the GPU KV caches. To avoid all of these complexities, we define this class: This allows the backend to easily use the KV caches, without having to deal with all of the above complexities. Before this PR, we handled SOME (but not all, e.g. mamba) of these complexities inside the CPU backend ( So basically, the offloading connector takes responsibility for "translating" the complex-layout kv_caches into a simple, canonical, easy to work with |
NickLucche
left a comment
There was a problem hiding this comment.
All cases need to assign to tensors_per_block, page_size_bytes and unpadded_page_size_bytes.
Yes that's the pattern I would like to factor out..
Possibly the canonical torch.tensor/raw creation in particular which is quite verbose.
Anyways I am unblocking given this is minor and could maybe be simply wrapped in a util or constructor method for CanonicalKVCaches, in order to keep core logic in worker file as lean as possible.
I'll leave it to you to shape as you see fit for best maintainability.
|
@orozery let's check CI sorry misclicked on closing somehow >.< |
89f5cca to
236714e
Compare
|
@NickLucche there was an issue with unit tests (incl. nixl connector) that were using |
236714e to
112c22b
Compare
|
Hi @orozery, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
This commit extends the offloading connector register_kv_caches function to support KV caches used in hybrid models. We define a new CanonicalKVCaches class which captures: 1. The unique set of KV cache tensors (as tensors maybe shared by multiple layers) 2. Mapping each group to its relevant KV cache data (given by a tensor pointer + page size). The canonical tensors are each of dtype int8 and shape (num_blocks, page_size). This commit also splits the offloading connector unit tests to multiple files. Signed-off-by: Or Ozeri <oro@il.ibm.com>
112c22b to
53eca8a
Compare
|
excited for this! i think its very needed for my deepseekv3.2 nvfp4 setup @orozery ty 🙏 |
|
i believe the merging of this PR would solve one of the issues we're working on right now as well. |
…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
### What this PR does / why we need it? Main2main upgrade vllm to 0330 fix breaks: 1. vllm-project/vllm#37728 add clear_row method for BlockTable 2. vllm-project/vllm#37975 Adapt GatedDeltaNetAttention Refactor 3. vllm-project/vllm#37698 update maybe_update_config in vllm_ascend/quantization/modelslim_config.py to adapt this pr change 4. vllm-project/vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer 5. vllm-project/vllm#37853 for now just to skip test_cpu_offloading.py test case utils this feature has been adapted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR extends the offloading connector
register_kv_cachesfunction to support KV caches used in hybrid models.We define a new
CanonicalKVCachesclass which captures:(num_blocks, page_size).This PR also splits the offloading connector unit tests to multiple files.